Skip to main content
Registration has reached capacity. Join the waitlist

All Accepted Papers

Understanding and Improving Communication Performance in Multi-node LLM Inference

Prajwal Singhania (University of Maryland), Siddharth Singh (University of Maryland), Lannie Dalton Hough (University of Maryland), Akarsh Srivastava (University of Maryland), Harshitha Menon (Lawrence Livermore National Laboratory), Charles Fredrick Jekel (Lawrence Livermore National Laboratory), Abhinav Bhatele (University of Maryland)

System Optimization & Efficiency

A detailed performance study of multi-node distributed LLM inference on GPU clusters that characterizes communication bottlenecks across model-parallel strategies—tensor, pipeline, and sequence parallelism—at scale. The results identify the dominant sources of inter-node communication overhead and provide optimization strategies validated on state-of-the-art inference engines.

Presentation

Talk

Paper Session 3: Systems Efficiency

Wednesday, May 27 · 3:40 PM – 3:50 PM

Bayshore Ballroom

Poster

Wednesday, May 27 · 5:15 PM – 6:45 PM

Carmel / Monterey

Abstract

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9×--3.6× lower latency than NCCL for message sizes between 128\,KB and 2\,MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72× reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

ACM CAIS 2026 Sponsors